Abstract
This project is all about applications of SLR to real data using R. I will be utilizing a dataset that contains information about employees in a company with years of experience and their current salary. My project data is taken from an online source of sample dataset of Employees vs Salary. I will be using textbook and the information from all the labs covered until now to process and analyze the data to see how the salary changes according to the years of experience of the employee. To check if the years of experience even matters for thte hike in the salary. Goal is to apply SLR(Simple Linear Regression) to examine the relationship between the two entities.Prithviraj Kadiyala
The following was taken from Forbes Articles
There has been a lot of buzz going around in the software industry about unequal pays to the employees who have been working for a single company longer. and Who recently graduated and gets very good salary. There also have been a lot of articles written about unequal pays to the loyal employees and employees changing their jobs every 3-4 years getting almost 50% hike in salaries.
Those very new to the tech industry, with less than a year of experience, can expect to earn $50,321 (a year-over-year increase of 9.8 percent). After a year or two, that average salary jumps to $62,517 (a whooping 24.3 percent increase, year-over-year).
Spend three to five years, and the average leaps yet again, to $68,040 (a 6.3 percent increase). Between six and ten years in the industry, salaries hit $83,143 (a rise of 6.8 percent).
Breaking the ten-year mark translates into big bucks. Those with 11 to 15 years of experience could expect to pull down $96,792 (a 3.8 percent increase over last year), while those with more than 15 years average $115,399 (a 6 percent increase).
Below is the graph that shows us the salary hike when employees jump companies:
The data was collected here:
dataset = read.csv("Emp_Salary.csv",header=TRUE,sep=",")
head(dataset)
names(dataset)
## [1] "Employee" "EducLev" "JobGrade" "YrsExper" "Age" "Gender"
## [7] "YrsPrior" "PCJob" "Salary"
library(s20x)
## Warning: package 's20x' was built under R version 3.4.4
pairs20x(dataset)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
g = ggplot(dataset, aes(x = YrsExper, y = Salary, color = EducLev)) + geom_point()
g = g + xlab("Years of Experience")
g = g + geom_smooth(method = "loess")
g
With the prospects of working in the software industry in the future. It would really cool to analyze the working of the IT industry beforehand and be prepared with what to do and when to do given the circumstances can put me into really good perspective of getting into the market and negotiating for a higher base salary package.
The following function was taken from https://rpubs.com/therimalaya/43190
trendscatter(YrsExper~Salary,f=0.5,data=dataset)
dataset.lm=lm(YrsExper~Salary,data=dataset)
summary(dataset.lm)
##
## Call:
## lm(formula = YrsExper ~ Salary, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.232 -4.333 -1.343 3.625 27.119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.584e+00 1.414e+00 -3.95 0.000107 ***
## Salary 3.822e-04 3.409e-05 11.21 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.52 on 206 degrees of freedom
## Multiple R-squared: 0.379, Adjusted R-squared: 0.376
## F-statistic: 125.7 on 1 and 206 DF, p-value: < 2.2e-16
normcheck(dataset.lm,shapiro.wilk = TRUE)
The p-value for the shapiro-wilk test is 0. The null hypothesis in this case would be that the errors are distributed normally.
\[\epsilon_i \sim N(0,\sigma^2)\]
The results of the Shapiro-wilk test indicate that we have enough evidence against to reject the null hypothesis(as the p-value is 0 compared to the standard of comparison 0.05) leading us to the conclusion that the data is not normally distributed.
Yrs.res=residuals(dataset.lm)
Yrs.fit=fitted(dataset.lm)
plot(Yrs.fit,Yrs.res, xlab="Fitted", ylab="Residuals", main="Fitted vs Residuals")
trendscatter(Yrs.fit,Yrs.res, xlab="Fitted", ylab="Residuals")
plot(dataset.lm, which =1)
\[R_{adj}^2 =\]
predict()ciReg()Remember to interpret this plot and all other plots